1,965 research outputs found

    Stratification bias in low signal microarray studies

    Get PDF
    BACKGROUND: When analysing microarray and other small sample size biological datasets, care is needed to avoid various biases. We analyse a form of bias, stratification bias, that can substantially affect analyses using sample-reuse validation techniques and lead to inaccurate results. This bias is due to imperfect stratification of samples in the training and test sets and the dependency between these stratification errors, i.e. the variations in class proportions in the training and test sets are negatively correlated. RESULTS: We show that when estimating the performance of classifiers on low signal datasets (i.e. those which are difficult to classify), which are typical of many prognostic microarray studies, commonly used performance measures can suffer from a substantial negative bias. For error rate this bias is only severe in quite restricted situations, but can be much larger and more frequent when using ranking measures such as the receiver operating characteristic (ROC) curve and area under the ROC (AUC). Substantial biases are shown in simulations and on the van 't Veer breast cancer dataset. The classification error rate can have large negative biases for balanced datasets, whereas the AUC shows substantial pessimistic biases even for imbalanced datasets. In simulation studies using 10-fold cross-validation, AUC values of less than 0.3 can be observed on random datasets rather than the expected 0.5. Further experiments on the van 't Veer breast cancer dataset show these biases exist in practice. CONCLUSION: Stratification bias can substantially affect several performance measures. In computing the AUC, the strategy of pooling the test samples from the various folds of cross-validation can lead to large biases; computing it as the average of per-fold estimates avoids this bias and is thus the recommended approach. As a more general solution applicable to other performance measures, we show that stratified repeated holdout and a modified version of k-fold cross-validation, balanced, stratified cross-validation and balanced leave-one-out cross-validation, avoids the bias. Therefore for model selection and evaluation of microarray and other small biological datasets, these methods should be used and unstratified versions avoided. In particular, the commonly used (unbalanced) leave-one-out cross-validation should not be used to estimate AUC for small datasets

    Short text authorship attribution via sequence kernels, Markov chains and author unmasking: An investigation

    Get PDF
    We present an investigation of recently proposed character and word sequence kernels for the task of authorship attribution based on relatively short texts. Performance is compared with two corresponding probabilistic approaches based on Markov chains. Several configurations of the sequence kernels are studied on a relatively large dataset (50 authors), where each author covered several topics. Utilising Moffat smoothing, the two probabilistic approaches obtain similar performance, which in turn is comparable to that of character sequence kernels and is better than that of word sequence kernels. The results further suggest that when using a realistic setup that takes into account the case of texts which are not written by any hypothesised authors, the amount of training material has more influence on discrimination performance than the amount of test material. Moreover, we show that the recently proposed author unmasking approach is less useful when dealing with short texts

    Investigating the Encoding of Words in BERT's Neurons using Feature Textualization

    Full text link
    Pretrained language models (PLMs) form the basis of most state-of-the-art NLP technologies. Nevertheless, they are essentially black boxes: Humans do not have a clear understanding of what knowledge is encoded in different parts of the models, especially in individual neurons. The situation is different in computer vision, where feature visualization provides a decompositional interpretability technique for neurons of vision models. Activation maximization is used to synthesize inherently interpretable visual representations of the information encoded in individual neurons. Our work is inspired by this but presents a cautionary tale on the interpretability of single neurons, based on the first large-scale attempt to adapt activation maximization to NLP, and, more specifically, large PLMs. We propose feature textualization, a technique to produce dense representations of neurons in the PLM word embedding space. We apply feature textualization to the BERT model (Devlin et al., 2019) to investigate whether the knowledge encoded in individual neurons can be interpreted and symbolized. We find that the produced representations can provide insights about the knowledge encoded in individual neurons, but that individual neurons do not represent clearcut symbolic units of language such as words. Additionally, we use feature textualization to investigate how many neurons are needed to encode words in BERT.Comment: To be published in 'BlackboxNLP 2023: The 6th Workshop on Analysing and Interpreting Neural Networks for NLP'. Camera-ready versio

    Genetic Diversity, Latency and Co-Infections

    Get PDF
    Alphaherpesviruses are highly prevalent in equine populations and co- infections with more than one of these viruses’ strains frequently diagnosed. Lytic replication and latency with subsequent reactivation, along with new episodes of disease, can be influenced by genetic diversity generated by spontaneous mutation and recombination. Latency enhances virus survival by providing an epidemiological strategy for long-term maintenance of divergent strains in animal populations. The alphaherpesviruses equine herpesvirus 1 (EHV-1) and 9 (EHV-9) have recently been shown to cross species barriers, including a recombinant EHV-1 observed in fatal infections of a polar bear and Asian rhinoceros. Little is known about the latency and genetic diversity of EHV-1 and EHV-9, especially among zoo and wild equids. Here, we report evidence of limited genetic diversity in EHV-9 in zebras, whereas there is substantial genetic variability in EHV-1. We demonstrate that zebras can be lytically and latently infected with both viruses concurrently. Such a co- occurrence of infection in zebras suggests that even relatively slow-evolving viruses such as equine herpesviruses have the potential to diversify rapidly by recombination. This has potential consequences for the diagnosis of these viruses and their management in wild and captive equid populations. View Full- Tex

    Quantitative Disentanglement of the Spin Seebeck, Proximity-Induced, and Ferromagnetic-Induced Anomalous Nernst Effect in Normal-Metal-Ferromagnet Bilayers

    Get PDF
    We identify and investigate thermal spin transport phenomena in sputter-deposited Pt/NiFe2_2O4-x_{\textrm{4-x}} (4≥x≥04\geq x \geq 0) bilayers. We separate the voltage generated by the spin Seebeck effect from the anomalous Nernst effect contributions and even disentangle the intrinsic anomalous Nernst effect (ANE) in the ferromagnet (FM) from the ANE produced by the Pt that is spin polarized due to its proximity to the FM. Further, we probe the dependence of these effects on the electrical conductivity and the band gap energy of the FM film varying from nearly insulating NiFe2_2O4_4 to metallic Ni33_{33}Fe67_{67}. A proximity-induced ANE could only be identified in the metallic Pt/Ni33_{33}Fe67_{67} bilayer in contrast to Pt/NiFe2_2Ox_{\rm x} (x>0x>0) samples. This is verified by the investigation of static magnetic proximity effects via x-ray resonant magnetic reflectivity

    Ferritin H deficiency deteriorates cellular iron handling and worsens Salmonella typhimurium infection by triggering hyperinflammation

    Get PDF
    Iron is an essential nutrient for mammals as well as for pathogens. Inflammation-driven changes in systemic and cellular iron homeostasis are central for host-mediated antimicrobial strategies. Here, we studied the role of the iron storage protein ferritin H (FTH) for the control of infections with the intracellular pathogen Salmonella enterica serovar Typhimurium by macrophages. Mice lacking FTH in the myeloid lineage (LysM-Cre+/+Fthfl/fl mice) displayed impaired iron storage capacities in the tissue leukocyte compartment, increased levels of labile iron in macrophages, and an accelerated macrophage-mediated iron turnover. While under steady-state conditions, LysM-Cre+/+Fth+/+ and LysM-Cre+/+Fthfl/fl animals showed comparable susceptibility to Salmonella infection, i.v. iron supplementation drastically shortened survival of LysM-Cre+/+Fthfl/fl mice. Mechanistically, these animals displayed increased bacterial burden, which contributed to uncontrolled triggering of NF-κB and inflammasome signaling and development of cytokine storm and death. Importantly, pharmacologic inhibition of the inflammasome and IL-1β pathways reduced cytokine levels and mortality and partly restored infection control in iron-treated ferritin-deficient mice. These findings uncover incompletely characterized roles of ferritin and cellular iron turnover in myeloid cells in controlling bacterial spread and for modulating NF-κB and inflammasome-mediated cytokine activation, which may be of vital importance in iron-overloaded individuals suffering from severe infections and sepsis

    Performance of the first prototype of the CALICE scintillator strip electromagnetic calorimeter

    Get PDF
    A first prototype of a scintillator strip-based electromagnetic calorimeter was built, consisting of 26 layers of tungsten absorber plates interleaved with planes of 45x10x3 mm3 plastic scintillator strips. Data were collected using a positron test beam at DESY with momenta between 1 and 6 GeV/c. The prototype's performance is presented in terms of the linearity and resolution of the energy measurement. These results represent an important milestone in the development of highly granular calorimeters using scintillator strip technology. This technology is being developed for a future linear collider experiment, aiming at the precise measurement of jet energies using particle flow techniques

    Resource-aware Research on Universe and Matter: Call-to-Action in Digital Transformation

    Full text link
    Given the urgency to reduce fossil fuel energy production to make climate tipping points less likely, we call for resource-aware knowledge gain in the research areas on Universe and Matter with emphasis on the digital transformation. A portfolio of measures is described in detail and then summarized according to the timescales required for their implementation. The measures will both contribute to sustainable research and accelerate scientific progress through increased awareness of resource usage. This work is based on a three-days workshop on sustainability in digital transformation held in May 2023.Comment: 20 pages, 2 figures, publication following workshop 'Sustainability in the Digital Transformation of Basic Research on Universe & Matter', 30 May to 2 June 2023, Meinerzhagen, Germany, https://indico.desy.de/event/3748
    • …
    corecore